Connect to GDrive and set working directory

!! Skip if you run in local !!

  1. Add a shortcut for working directory('IDPCode') to your drive as depicted below: GDriveConnect.png
  2. Run the command below to connect the GDrive:

Start from here if you run in local !!

Initial Info about the dataset

The dataset in 'All_Papers_In_Plain_Text_TIKA.pkl' contains extracted text page by page per document. Each document has one row and has columns as much as they need in the dataset. Any datapreprocessing step is not applied to data in 'All_Papers_In_Plain_Text_TIKA.pkl'. It is not even normalized. It has many documents that have just one page. Our purpose here was to test the accuracy value against different datapreprocessing steps as simple as we can. Afterwards, we will do further tests by normalizing the data. So to summerize:

You can skip '1. Data pre-processing' section and use the pre-processed data by the code below:

1. Data pre-processing

1.1. Clean text

1.2. Detect document language and remove non-English Sentences

1.3. Stopwords

from https://sraf.nd.edu/textual-analysis/resources/

1.3.2 Take stopwords list from MALLET

1.3.3 Combine stopwords

1.3.4 Save stopwords to investigate later manually

1.3.5 Remove stop words

1.4. Lemmatization

(Skipped) 1.5. Stemming

May lead to confusion, think about that again!!!

2. Prepare Test Content

Normalize Data

3. Remove non-relevant data

4. Visualize Data

Unigrams

Bigrams

4. LDA Topic Modeling

4.1. Visualize topics

By default the topics are projected to the 2D plane using PCoA on a distance matrix created using the Jensen-Shannon divergence on the topic-term distributions. You can pass in a different multidimensional scaling function via the mds parameter. In addition to pcoa, other provided options are tsne and mmds which operate on the same JS-divergence distance matrix. Both tsne and mmds require that you have sklearn installed.

Dimension reduction via Jensen-Shannon Divergence &

Answer: https://stackoverflow.com/questions/50923430/what-does-the-parameter-mds-mean-in-the-pyldavis-sklearn-prepare-function

We used tsne below. Check for others: https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb Also check to learn how LDAVis works: https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf And check the LDAVis from the original paper: https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf Why LDAVis presentation: https://speakerdeck.com/bmabey/visualizing-topic-models

4.2. Extract Topics

4.3. Make Prediction

4.4. Prepare pre-trained model

4.4.1. Save the model

Save the model to use as pre-trained model on the https://simple-recommender.herokuapp.com/ website.

4.4.2. Add Title, Author, metadata to core model

Use useful data from RELAVENT_DATA.xlsx

4.5. Measure similarity

4.6 Make single prediction

5. Calculate Accuracy

 5.1. Calculate Top-5 Accuracy

 5.2. Calculate Top 1,5,20,100 Accuracy